Select an Expression Data Set

Questions and answers

  1. What are the control and test conditions of the dataset?
  1. Why is the dataset of interest to you?
  1. Were there expression values that were not unique for specific genes? How did you handle these?

  2. Were there expression values that could not be mapped to current HUGO symbols?

  1. How did you handle replicates?
  1. What is the final coverage of your dataset?

Pre-work

Download all the require packages and reference the libraries

readr allows the program to read large datasets.

biomaRt is a Bioconductor package that implements the RESTful API of biomart, the annotation framwork for model organism genomes at the EBI. It is a Bioconductor package, and as such it needs to be loaded via the BiocManager,

Reference the libraries

Load the data

Assess

Check if there is any duplicated genes or features

Mapping

Mapping via biomaRt The data that I am working contains HUGO gene symbols so I am going to map it to Ensembl ID

Take a look at the new mapped table

Clean

Check duplicates

To fix those duplicates - Find them

Make sure there is no empty values

Only 6 duplicates so we can check it on the ensembl gene portal

According to Ensembl gene portal - LINC00856 is an external reference matched to LINC00595 - C12orf74 is an external reference matched PLEKHG7 - We can remove both

Save the mapping

Normalization

Group all the subtypes/ features

Compare read counts in different breast tumour subtypes

Plot

Now, we can see the distrubution of the data

Boxplot - with data from the original data before normalization

MDS Plot - Compare between all subtypes

Density Plot

Decription of article

The of the most dynamic cell populations found in turmous are leukocytes. They play a very important role in normal breast tissue remodeling during pregnancy and involution. This study performed a gene expression analysis in leukocytes sorted from normal breast tissues, ductal carcinomas in situ (DCIS), and HER2+ and triple negative invasive ductal carcinomas (IDC). It focuses on the immune escape during breast tumor progression.

About the dataset

This study performed RNA-seq on purified CD45+ T cells from normal breast tissues (n=12),DCIS (n=11), and IDC (n=12), focusing on HER2+CD3+ (n=5) and TN (n=6) IDC cases. RNA was isolated from purified CD45+CD3+ leukocytes by cell sorting.

Source name resected breast tumor Organism Homo sapiens Characteristics tissue of origin: breast (tumor) cell type: CD45+CD3+ T cell breast tumor subtype: DCIS/ IDC/ None

Results from A1 Retrieved dataset(GSE87517) from GEOquery.Checked if there is any duplicated genes and features in the dataset. Since the dataset contains HUGO symbols so I mapped it with ENSEMBL gene id. Gathered all the information from BioManager::useMart and mapped it with our dataset. Only 6 duplicates existed in the mapping therefore I manually filtered out the duplicates by checking them on the ensembl gene portal. Normalized the data and filtered out low counts data. Look at the distribution of the dataset by Boxplot, MDS plot and Density plot. The final coverage of my dataset is 26440 out of 27011 observed values (missing ~2.2% of the dataset)

Prework

Install all the required packages

Reference the packages

Gene of interest

Look at the most common genes of the same subfamily in our dataset

CD8A - According to Entrez Gene Summary: - CD8A acts as a coreceptor with the T-cell receptor on the T lymphocyte to recognize antigens

CTLA-4 : immune checkpoint proteins - mentioned in the results of the article

Differential expression analysis

Use edgeR to perform differential expression analysis as edgeR is designed for analyzing RNA-seq data. using Quasi liklihood method to get the estimation of allowing overdispersion and this method can fit data exhibiting overdispersion using fully specified probability models

Design the model for the anaylsis

Take a look at the MDS plot

According to above MDS Plot, setting the magnitude of log-fold change of at least 1 and less than one seems reasonable.

Calculate differential expression

Make a contrast table explicitly comparing normal genes and affected (dcis and idc) genes

Plot the heatmap to compare the expressions of normal/ unaffected genes and affected genes.

Up regualted genes

Down regulated genes

Save all genes lists

Thresholded Over-representation analysis

To perform a thresholded gene set enrichment analysis, we can either go on the web interface (https://biit.cs.ut.ee/gprofiler/gost) to run g:Profiler or use the gprofiler2 package.

Using the function gost() to retrieve the functional enrichment analysis of gene lists. The annotation data and the version of the annotation I am using is:

GO:MF – releases/2019-07-01

GO:CC – releases/2019-07-01

GO:BP – releases/2019-07-01

KEGG – KEGG FTP Release 2019-09-30

REAC – annotations: ensembl

classes: 2019-10-2

WP – 20190910

TF – annotations: TRANSFAC Release 2019.1

classes: v2

MIRNA – Release 7.0

HPA – annotations: HPA website: 2017-12-01

classes: script: 2018-03-19

CORUM – 03.09.2018 Corum 3.0

HP – hpo.annotations.monthly #157

Decription of article

The of the most dynamic cell populations found in turmous are leukocytes. They play a very important role in normal breast tissue remodeling during pregnancy and involution. This study performed a gene expression analysis in leukocytes sorted from normal breast tissues, ductal carcinomas in situ (DCIS), and HER2+ and triple negative invasive ductal carcinomas (IDC). It focuses on the immune escape during breast tumor progression.

About the dataset

This study performed RNA-seq on purified CD45+ T cells from normal breast tissues (n=12),DCIS (n=11), and IDC (n=12), focusing on HER2+CD3+ (n=5) and TN (n=6) IDC cases. RNA was isolated from purified CD45+CD3+ leukocytes by cell sorting.

Source name resected breast tumor Organism Homo sapiens Characteristics tissue of origin: breast (tumor) cell type: CD45+CD3+ T cell breast tumor subtype: DCIS/ IDC/ None

Results from A1 Retrieved dataset(GSE87517) from GEOquery.Checked if there is any duplicated genes and features in the dataset. Since the dataset contains HUGO symbols so I mapped it with ENSEMBL gene id. Gathered all the information from BioManager::useMart and mapped it with our dataset. Only 6 duplicates existed in the mapping therefore I manually filtered out the duplicates by checking them on the ensembl gene portal. Normalized the data and filtered out low counts data. Look at the distribution of the dataset by Boxplot, MDS plot and Density plot. The final coverage of my dataset is 26440 out of 27011 observed values (missing ~2.2% of the dataset)

Prework

Install all the required packages

Reference the packages

  library(Biobase)
  library(ComplexHeatmap)
  library(circlize)
  library(gprofiler2)

Non-thresholded Gene set Enrichment Analysis

Non-thresholded gene set enrichment analysis is performed using GSEA v4.0.3. The text file of the ranked-list is generated from A2. The rank is calculated by log(diff_exp\(PValue,base =10) * sign(diff_exp\)logFC).

head(diff_exp)
##          x      logFC    logCPM           F       PValue          FDR
## 1     A1BG  2.9400748 2.4450740  8.82681786 3.774597e-03 0.0219439121
## 2 A1BG-AS1  7.5598692 6.6977064 24.30061395 3.568651e-06 0.0002682045
## 3     A1CF -0.1565448 0.2504336  0.02953652 8.639183e-01 0.9191069107
## 4      A2M -2.1776375 4.4929525  4.57689425 3.501582e-02 0.1052300017
## 5  A2M-AS1 -3.7651267 3.5712110 13.24770546 4.475751e-04 0.0051791160
## 6    A2ML1  1.4551196 3.5230584  1.86643993 1.751669e-01 0.3226719275
##          rank
## 1  2.42312941
## 2  5.44749588
## 3 -0.06352734
## 4 -1.45573569
## 5 -3.34913411
## 6  0.75654800
#upregulated genes
head(up_genes)
## [1] A1BG     A1BG-AS1 AADACL3  AADAT    AAK1     AANAT   
## 27011 Levels: A1BG A1BG-AS1 A1CF A2M A2M-AS1 A2ML1 A2MP1 A3GALT2 A4GALT ... ZZZ3
#downregulated genes
head(down_genes)
## [1] A2M     A2M-AS1 AADACL4 AAMDC   ABCA1   ABCA9  
## 27011 Levels: A1BG A1BG-AS1 A1CF A2M A2M-AS1 A2ML1 A2MP1 A3GALT2 A4GALT ... ZZZ3

Methods: GSEA v4.0.3 for Windows

Genesets: Genesets from the baderlab geneset collection from February 1, 2020 containing GO biological process, no IEA and pathways

Version: v4.0.3

Results:

Top 10 upregualted genes:

The complement system is made up of a large number of distinct plasma proteins that react with one another to opsonize pathogens and induce a series of inflammatory responses that help to fight infection.

Top 10 downregulated genes:

Compare results:

From A2, we perfromed a enrichment analysis using g:profiler. From there, the top upgregulated are: cellcular response to cytokine stimulus, response to cytokine, cytokine-mediated signaling pathway, cellcular response to tumor necrosis facr and organelle envelope. The result from g:profiler mostly consists of cellular and biological response to cytokine which is quite different from the results from GSEA.

Enrichment Analysis

Enrichment Analysis is perfromed using Cytoscape 3.7.2 and the parameters are the followings:

Analysis type: GSEA Enrichment pos: gsea_report_for_na_pos_1585770708151.xls ENrichment neg: gsea_report_for_na_neg_1585770708151.xls GMT: Human_GOBP_AllPathways_no_GO_iea_February_01_2020_symbol.gmt Ranks: ranked_gene_list_na_pos_versus_na_neg_1585770708151.txt Expression: normalized_data.txt Phenotype: na_pos, na_neg FDR : 0.9

All the files are used in Cytoscape are generated from GSEA except normalized_data.txt. normalized_data.txt is the differential expression that is calculated in A2 and is inputed in a text file. It consists of all the gene names and their LogFC value.

There are 463 nodes and 1469 edges showing in the network.

The geneset with the highest positive NES:

The geneset with the lowest negative NES:

Annotated cluster: Using the application in Cytoscape, AutoAnnotate, to create annotation set

The parameters are the followings: Cluster Alogrithm: MLC cluster Edge Weight column: simlarity_coefficient Label Column: GS_DESCR Label Algorithm: WordCloud: Adjacent Words Max Word per lable: 3 Adjacent word bonus: 8

The annotation network has 90 clusters. The range size of the clusters is from 2 nodes to 33 nodes.

The top 3 largest upregulated cluster: mrna translation termination il2 1pathway il nucleus trna export

1 major downregulated cluster: chemokine chemotaxis migration

Putting the annoatated network into theme Go to AutoAnnotate panel to Create Summary Network, then select Clusters and Unclustered Nodes

Now, in the network, we have 189 nodes and 33 edges. 34 of all nodes are not clustered together and the rest are clusted and linked together.

Wikipathway

From the GSEA reault, we can see MATRIX METALLOPROTEINASES pathway is in the top 10 upgregulated list. The MMP family and their inhibitor play a very important role multiple biological functions in all stages of cancer development. The stages of cancer developement include initiation to outgrowth of clinically relevant metastases and likewise in apoptosis and angiogenesis. MMPs and their inhibitors are crucital to many researchers and are investigated to be an anticancer drug

Import the network from Public database, searched the pathway by Wikipathway and entered WP129

Import the differential expression data (normalized_data.txt) to the pathway Change the color of the enzymes according to the expression level from normalized_data.txt

MMP3 and TIMP4 show the highest expression value and TCF20 shows the lowerst expression value

In the article, MMPs are not mentioned in the paper but from the results from A2, it is one of the most abundant family expressed in the dcis and IDC features. MMP3 involves in the breakdown of extracellular matrix in normal physiological processes, such as embryonic development, reproduction, and tissue remodeling, as well as in disease processes, such as arthritis and metastasis.As TIMP4 is the inhibitor of the matrix metalloproteinases which involves in degradation of the extracellular matrix. On the other hand, TCF20 is downregualted and is transcriptional coactivator and can enhance the activity of transcription factors. Mutation on this protein can lead to autism spectrum disorders which is less related to our article.